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Abstract. We study the optimal rates of convergence for estimating a 
prior distribution over a VC class from a sequence of independent data 
sets respectively labeled by independent target functions sampled from 
the prior. We specifically derive upper and lower bounds on the optimal 
rates under a smoothness condition on the correct prior, with the num¬ 
ber of samples per data set equal the VC dimension. These results have 
implications for the improvements achievable via transfer learning. We 
additionally extend this setting to real-valued function, where we estab¬ 
lish consistency of an estimator for the prior, and discuss an additional 
application to a preference elicitation problem in algorithmic economics. 


1 Introduction 

In the transfer learning setting, we are presented with a sequence of learning 
problems, each with some respective target concept we are tasked with learning. 
The key question in transfer learning is how to leverage our access to past learn¬ 
ing problems in order to improve performance on learning problems we will be 
presented with in the future. 

Among the several proposed models for transfer learning, one particularly ap¬ 
pealing model supposes the learning problems are independent and identically 
distributed, with unknown distribution, and the advantage of transfer learning 
then comes from the ability to estimate this shared distribution based on the 
data from past learning problems Ena . For instance, when customizing a speech 
recognition system to a particular speaker’s voice, we might expect the first few 
people would need to speak many words or phrases in order for the system to ac¬ 
curately identify the nuances. However, after performing this for many different 
people, if the software has access to those past training sessions when customiz¬ 
ing itself to a new user, it should have identified important properties of the 
speech patterns, such as the common patterns within each of the major dialects 
or accents, and other such information about the distribution of speech patterns 
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within the user population. It should then be able to leverage this information 
to reduce the number of words or phrases the next user needs to speak in or¬ 
der to train the system, for instance by first trying to identify the individual’s 
dialect, then presenting phrases that differentiate common subpatterns within 
that dialect, and so forth. 

In analyzing the benefits of transfer learning in such a setting, one important 
question to ask is how quickly we can estimate the distribution from which 
the learning problems are sampled. In recent work, m have shown that under 
mild conditions on the family of possible distributions, if the target concepts 
reside in a known VC class, then it is possible to estimate this distribtion using 
only a bounded number of training samples per task: specifically, a number of 
samples equal the VC dimension. However, that work left open the question of 
quantifying the rate of convergence. This rate of convergence can have a direct 
impact on how much benefit we gain from transfer learning when we are faced 
with only a hnite sequence of learning problems. As such, it is certainly desirable 
to derive tight characterizations of this rate of convergence. 

The present work continues that of [12] , bounding the rate of convergence for 
estimating this distribution, under a smoothness condition on the distribution. 
We derive a generic upper bound, which holds regardless of the VC class the 
target concepts reside in. The proof of this result builds on that earlier work, but 
requires several interesting innovations to make the rate of convergence explicit, 
and to dramatically improve the upper bound implicit in the proofs of those 
earlier results. We further derive a nontrivial lower bound that holds for certain 
constructed scenarios, which illustrates a lower limit on how good of a general 
upper bound we might hope for in results expressed only in terms of the number 
of tasks, the smoothness conditions, and the VC dimension. 

We additionally include an extension of the results of [12] to the setting of 
real-valued functions, establishing consistency (at a uniform rate) for an esti¬ 
mator of a prior over any VC subgraph class. In addition to the application 
to transfer learning, analogous to the original work of [12], we also discuss an 
application of this result to a preference elicitation problem in algorithmic eco¬ 
nomics, in which we are tasked with allocating items to a sequence of customers 
to approximately maximize the customers’ satisfaction, while permitted access 
to the customer valuation functions only via value queries. 


2 The Setting 

Let {X,Bx) be a measurable space [8] (where X is called the instance space), 
and let U be a distribution on X (called the data distribution). Let C be a VC 
class of measurable classifiers h \ X ^ {~1) +1} (called the concept space), and 
denote by d the VC dimension of C m- We suppose C is equipped with its 
Borel cr-algebra B induced by the pseudo-metric p{h,g) = X>({a: € X : h{x) ^ 
g(a:)}). Though our results can be formulated for general T) (with somewhat 
more complicated theorem statements), to simplify the statement of results we 
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suppose p is actually a metric, which would follow from appropriate topological 
conditions on C relative to V. 

For any two probability measures pi, p 2 on a measurable space (17, T), define 
the total variation distance 


11/^1 - /^ 2 || = sup pi{A) - P 2 {A). 


For a set function p on a finite measurable space (17, we abbreviate p{uj) = 
^({w}), Vw G 17. Let IIq = {ng : 9 G 0} be a family of probability measures on 
C (called priors), where 0 is an arbitrary index set (called the parameter space). 
We suppose there exists a probability measure ttq on C (the reference measure) 
such that every irg is absolutely continuous with respect to ttq, and therefore has 
a density function fg given by the Radon-Nikodym derivative [5]. 

We consider the following type of estimation problem. There is a collection 
of C-valued random variables {h*g : t € N,9 € 0}, where for any fixed 9 € 0 
the {h*g}'^i variables are i.i.d. with distribution irg. For each 9 G 0, there is 
a sequence Z*{9) = {{Xti,Yti{9)), {Xt2,Yt2{9)), .. where are i.i.d. 

V, and for each G N, Yti{9) = hlg{Xti). We additionally denote by Zl{9) = 
{{Xti,Yti{9)),..., {Xtk, Ytk{9))} the first k elements of Z*{9), for any fc G N, and 
similarly Xtk = , Xtk} and ¥(^(0) = {Yti(0),..., Tifc(0)}. Following the 

terminology used in the transfer learning literature, we refer to the collection of 
variables associated with each t collectively as the t**' task. We will be concerned 
with sequences of estimators = St{ZI{9), ..., Z'^(9)), for T G N, which are 
based on only a bounded number k of samples per task, among the first T tasks. 
Our main results specifically study the case of d samples per task. For any such 


estimator, we measure the risk as E 


II 






, and will be particularly 


interested in upper-bounding the worst-case risk supg^gQ E 


Sts* 



as a 


function of T, and lower-bounding the minimum possible value of this worst-case 
risk over all possible 9t estimators (called the minimax risk). 

In previous work, [12] showed that, if lie is a totally bounded family, then 
even with only d number of samples per task, the minimax risk (as a function 
of the number of tasks T) converges to zero. In fact, that work also proved 
this is not necessarily the case in general for any number of samples less than 
d. However, the actual rates of convergence were not explicitly derived in that 
work, and indeed the upper bounds on the rates of convergence implicit in that 
analysis may often have fairly complicated dependences on C, 77©, and 77, and 
furthermore often provide only very slow rates of convergence. 

To derive explicit bounds on the rates of convergence, in the present work we 
specifically focus on families of smooth densities. The motivation for involving a 
notion of smoothness in characterizing rates of convergence is clear if we consider 
the extreme case in which Uq contains two priors tti and 7r2, with 7ri({h}) = 
T^ 2 {{g}) = 1, where p{h,g) is a very small but nonzero value; in this case, if we 
have only a small number of samples per task, we would require many tasks (on 
the order of 1/pih, g)) to observe any data points carrying any information that 
would distinguish between these two priors (namely, points x with h(x) g{x)); 






4 


Liu Yang, Steve Hanneke, and Jaime Carbonell 


yet IItti — 7 r 2 || = 1, so that we have a slow rate of convergence (at least initially). A 
total boundedness condition on Uq would limit the number of such pairs present 
in TTg), so that for instance we cannot have arbitrarily close h and g, but less 
extreme variants of this can lead to slow asymptotic rates of convergence as well. 
Specifically, in the present work we consider the following notion of smoothness. 
For L G (0, oo) and a G (0,1], a function f : C R is (L, a)-H61der smooth if 

Vh,gGC,lf(h)-f(g)l<Lp(h,gr. 

3 An Upper Bound 

We now have the following theorem, holding for an arbitrary VC class C and 
data distribution V; it is the main result of this work. 

Theorem 1. For IIq any class of priors on C having {L,a)-Hdlder smooth 
densities {fe : 6 G 0}, for any T G N, there exists an estimator Ore = 
§T{2^{d), ■ ■ ■, Zj(9)) such that 

sup EIItTa — TTg^ W = O { LT~ 2 C£i+ 2 c,)(<»+ 2 (d+i)) 

e.ee ^ V 

Proof. By the standard PAC analysis [913] . for any 7 > 0, with probability 
greater than 1 — 7 , a sample of fc = 0{{d/^) log(l/ 7 )) random points will par¬ 
tition C into regions of width less than 7 (under LiifD)). For brevity, we omit 
the t subscripts and superscripts on quantities such as Zl{9) throughout the 
following analysis, since the claims hold for any arbitrary value of t. 

For any 0 G 0, let iTg denote a (conditional on Vi,... ,Xk) distribution de¬ 
fined as follows. Let fg denote the (conditional on Xi, ..., Xk) density function of 
TT'g with respect to ttq, and for any 5 € C, let fj^ig) = 

(or 0 if 7 ro({h G C : Vi < k,h{Xi) = g{Xi)}) = 0). In other words, Hg has 
the same probability mass as -Kg for each of the equivalence classes induced by 
Xi,..., Xk, but conditioned on the equivalence class, simply has a constant- 
density distribution over that equivalence class. Note that every h G C has 
fg{h) between the smallest and largest values of fg{g) among g G C with 
Vi < k,g{Xi) = h{Xi)\ therefore, by the smoothness condition, on the event 
(of probability greater than 1 — 7 ) that each of these regions has diameter less 
than 7 , we have Vh G C, \fe{h) — fg{h)\ < L 7 “. On this event, for any 9,9' G 0, 

hg-TTg,\\ = {l/2) j \fe - fe,\d7To < Lr + (1/2) j |/^-/^,|d^o. 

Furthermore, since the regions that define fg and fg, are the same (namely, the 
partition induced by ATi,..., Xk), we have 

{1/2) j \fg-fg,\diT^ = {l/2) Y. \M{h&€-.^i<k,h{X,) = yf) 


7re/({/i G C : Vi < k,h{Xi) = yj)| 
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Thus, we have that with probability at least 1 — 7 , 

\\ t ^0 - TTS'll < -^ 7 “ + l|]PYfc(6()|Xfc - PYfc(S')|xJ|- 

Following analogous to the inductive argument of HU, suppose J C {1,..., fc}, 
fix xi e and yj G {—1,+1}I^L Then the yj G {—1,+1}I^I for which 

\\yi ~ yi\\i is minimal, subject to the constraint that no h G C has h{xj) = ijj^ 
has (1/2)112/7 - yiWi < d + 1; also, for any i G / with yi ^ yi, letting y'- = yj for 
j G / \ {i} and y' = yi, we have 

]P'Yj(e)|X/(y/|S/) = iP’Yj-\{i}(e)|Xj\{i}(y7\{i}l^/\{z}) “ lP’Yj(0)|X/(y7|S/), 
and similarly for d', so that 

\^Yi{e)\xi{yi\xi) - PYj(e')|Xj-(y/|S/)l 

^ \^VJ^^,yi9)\X,y^,^{yI\{i}\XI\{i}) -^Y,yy,y{e')\Xryy,y{yi\{i}\Xl\{i})\ 

+ |PYHe)|Xj-(y7l^/) - iPYKe')|Xr(2/7|S/)|- 

Now consider that these two terms inductively define a binary tree. Every time 
the tree branches left once, it arrives at a difference of probabilities for a set I 
of one less element than that of its parent. Every time the tree branches right 
once, it arrives at a difference of probabilities for a yi one closer to an unrealized 
2/7 than that of its parent. Say we stop branching the tree upon reaching a set 
/ and a yj such that either 2/7 is an unrealized labeling, or |/| = d. Thus, we 
can bound the original (root node) difference of probabilities by the sum of the 
differences of probabilities for the leaf nodes with |/| = d. Any path in the tree 
can branch left at most k — d times (total) before reaching a set I with only d 
elements, and can branch right at most d + 1 times in a row before reaching a yi 
such that both probabilities are zero, so that the difference is zero. So the depth 
of any leaf node with |/| = d is at most {k — d)d. Furthermore, at any level of the 
tree, from left to right the nodes have strictly decreasing |/| values, so that the 
maximum width of the tree is at most k — d. So the total number of leaf nodes 
with |/| = d is at most (fc — d)^d. Thus, for any y G {—1, +1}^' and x G 


|]PYfc(e)|Xfc(y|a^) - ]PYfc(e')|Xfc(d|a^)l 

<{k-dfd- max max |PY^(e)|Xrf(y‘^|S77) - PY,i(e')|xJd‘'|S73)|- 
Since 

ll'P'Yfc(e)|Xfc -PYfe(e')|xJI = (1/2) ^ |PYfc(e)|xJd^) - PYfc(e')|Xfc(/')l, 

and by Sauer’s Lemma this is at most 

{ekY max |PY^e)|Xfc(d'')-PYfc(e')|xJd'')|, 

—+ 
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we have that 


l|IPYfc(e)|Xfc -IPYfc(e')|xJI 

<{ekfk'^d max max |PY^(e)|Xc- PY,j(e')|XD(y‘^)l- 
Thus, we have that 


||7re - TT0'\\ = E||7re - 7re/|| 


< 7+L7“ + (e/c)‘'/c2dE 


max max 


PYd(e)|XD(2/‘^) - PYd(e')|XD(y‘^)l 


Note that 
E 
< 


max max IPY^i'fliiXr, (y'*) — E’ 


E E E 




< (2fc)'^ max max E 


and by exchangeability, this last line equals 


(r)i 

) -EY^(e')|XD(y‘^)|], 


{ 2 k)^ max E [|PY^(e)|Xrf(y‘^)-PYd(e')|Xrf(/)|] ■ 
?/“€{ — 1 , + 1 }“ 


[H] showed that E [|PYd(e)|Xrf(y‘') - PYrf(e')|Xrf( 2 /"*)l] < ^^/W za(e)-^Za( 0 ')\l so 
that in total we have ||7re — 7re/|| < {L + 1 ) 7 “ + 4 ( 2 efc)^‘^+^||P 2 _^(e) —'Pza(e') II • 
Plugging in the value of fc = c{d/^) log(l/ 7 ), this is 

/ T / 1 \ \ 2 ci +2 _ 

(I/ + l) 7 “ + 4^260-log \\T Zd(e')\\- 

Thus, it suffices to bound the rate of convergence (in total variation distance) 
of some estimator of P 2 ^( 6 (^). If N (e) is the e-covering number of {f‘z^{ 9 ) '■ & & 0}, 
then taking 0x6, as the minimum distance skeleton estimate of [EE] achieves ex¬ 
pected total variation distance e fromP^^(g^), for some T = 0((l/e^) log A^(e/4)). 
We can partition C into 0((L/e)‘^/“) cells of diameter 0((e/L)^/“), and set a 
constant density value within each cell, on an 0(e)-grid of density values, and 
every prior with (L, a)-H61der smooth density will have density within e of some 
density so-constructed; there are then at most such densities, so 

this bounds the covering numbers of Uq. Furthermore, the covering number of 
Uq upper bounds N{e) [12], so that N{e) < 

Solving T = 0{£~'^(L jeY^ log(l/e)) for e, we have e = 0 
So this bounds the rate of convergence for EjjP^ ,(eV) - P 2 d(e*)ll’ foi' the 
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minimum distance skeleton estimate. Plugging this rate into the bound on the 
priors, combined with Jensen’s inequality, we have 


E|| 7 rg^ - TTg^ II < (L + 1 ) 7 “ + 4 ( 2ec- log 


7 


2d+2 

X O 


j^ iog(rL) ^™ 


This holds for any 7 > 0, so minimizing this expression over 7 > 0 yields a 
bound on the rate. For instance, with 7 = 0 (T~ 2 (<i+ 2 a)(a+ 2 (d+i)) J ^ .^^0 have 


eiItt^-^ - TTgji = (5 (Tr"=('*+2“)(“+2(<i+i)) 


□ 


4 A Minimax Lower Bound 

One natural quesiton is whether Theorem [T] can generally be improved. While 
we expect this to be true for some fixed VC classes (e.g., those of finite size), 
and in any case we expect that some of the constant factors in the exponent 
may be improvable, it is not at this time clear whether the general form of 
rp- 0 {a /(d+a) ) jg sometimes optimal. One way to investigate this question is to 
construct specific spaces C and distributions "D for which a lower bound can be 
obtained. In particular, we are generally interested in exhibiting lower bounds 
that are worse than those that apply to the usual problem of density estimation 
based on direct access to the h^g values (see Theorem [3] below). 

Here we present a lower bound that is interesting for this reason. However, 
although larger than the optimal rate for methods with direct access to the 
target concepts, it is still far from matching the upper bound above, so that the 
question of tightness remains open. Specifically, we have the following result. 

Theorem 2. For any integer d > 1, any L > 0,a €: (0,1], there is a value 
C{d,L,a) € (0,oo) such that, for any T € N, there exists an instanee spaee 
X, a concept space C of VC dimension d, a distribution T> over X, and a 
distribution ttq over C such that, for 77© a set of distributions over C with 
(L,a)-Hdlder smooth density functions with respect to TTn, any estimator 9 t = 
has 


sup E 
e*ee 




> C{d,L,a)T~^(^. 


Proof. (Sketch) We proceed by a reduction from the task of determining the bias 
of a coin from among two given possibilities. Specifically, fix any 7 £ (0,1/2), 
n £ N, and let Bi{p ),..., Bn{p) be i.i.d Bernoulli(p) random variables, for each 
p £ [ 0 , 1 ]; then it is known that, for any (possibly nondeterministic) decision rule 
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:{0,1}"^{(1+7)/2,(1-7)/2}, 

i ¥{pr,{Bi{p),...,Bn{p)) ^p) 

pe{(l+7)/2,(l-7)/2} 

> (1/32)-exp{-1287V3}- (1) 

This easily follows from the results of [T] , combined with a result of [7] bounding 
the KL divergence (see also m) 

To use this result, we construct a learning problem as follows. Fix some 
m G N with TO > d, let T = {1,..., to}, and let C be the space of all classifiers 
h X ^ {—!) +1} such that \{x G X : h{x) = +1}| < d. Clearly the VC dimen¬ 
sion of C is d. Define the distribution B as uniform over X. Finally, we specify 
a family of (L,a)-H61der smooth priors, parameterized by 0 = {—1, 
as follows. Let 7 ^ = (L/2)(1/to)“. First, enumerate the (™) distinct d-sized 
subsets of {1,..., to} as Ti, fLj,..., X^m-^. Define the reference distribution ttq 
by the property that, for any d G C, letting q = \{x ■. h{x) = -|-1}|, 7 ro({d}) = 

^ = (^1’ ■ • ■ ’^(™)) ^ define the prior TTb as 

the distribution of a random variable db specified by the following generative 
model. Let i* ~ Uniform({l,..., (™)}), let 0b(**) Bernoulli((l -|- 
finally, h\y ~ Uniform({/i G C : {x : h{x) = -1-1} C T,., Parity(|{x : h{x) = 
-|-1}|) = 0b(**)}), where Parity(n) is 1 if n is odd, or 0 if n is even. We 
will refer to the variables in this generative model below. For any /i G C, 
letting H = {x \ h(x) = -1-1} and q = |id|, we can equivalently express 

M{h}) = C + 

From this explicit representation, it is clear that, letting /b = we have 
fh{h) G [1 — 7 m, 1 + 7m] for all d G C. The fact that /b is Holder smooth follows 
from this, since every distinct h,g G C have D({x : h{x) ^ d(a:)}) > 1 /to = 

Next we set up the reduction as follows. For any estimator ttt = 7f7'(2/(d*), 
..., Zj (d*)), and each i G {1,..., (™)}, let hi be the classiher with {x : hi{x) = 
-1-1} = Xi] also, if 7r7’({/i,;}) > (|)‘^/(’d), let bi = 2Parity(d) — 1, and otherwise 
bi = 1 — 2Parity(d). We use these bi values to estimate the original bi values. 
Specifically, let pi = (1 -|- 7m&i)/2 and pi = {1 + jrnbi)/2, where b = d*. Then 

(?) 

IlffT - II > (1/2) ^ |7fT({d.}) - 7ToA{h^})\ 

i=l 

(?) (?) ^ 

^ (1/2) E - ^*1/2 = (1/2) E 

i=l ^ \d) i=l ^ {dJ 

Thus, we have reduced from the problem of deciding the biases of these (™) 
independent Bernoulli random variables. To complete the proof, it suffices to 
lower bound the expectation of the right side for an arbitrary estimator. 
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Toward this end, we in fact study an even easier problem. Specifically, con¬ 
sider an estimator qi = ..., Zj (0*), i*,..., i^), where is the i* ran¬ 

dom variable in the generative model that dehnes h^g ; that is, il ^ Uniform({l, 

..., (™)}), Ct ^ Bernoulli((l -|- '-frnbi*)/‘2), and h^g^ ~ Uniform({/i g C : {x : 
h{x) = -1-1} C A’i*,Parity}I{x : h{x) = -|-1}|) = Ct}), where the are indepen¬ 
dent across t, as are the Ct and h*g . Clearly the pi from above can be viewed as 
an estimator of this type, which simply ignores the knowledge of The knowl¬ 
edge of these variables simplifies the analysis, since given {i} ■ t <T}, the data 
can be partitioned into (™) disjoint sets, {{Zg{9^,) : = i} : i = 1,, (™)}, 

and we can use only the set = i} to estimate pt. Furthermore, we 

can use only the subset of these for which 'Ktd = Ti, since otherwise we have 
zero information about the value of Parity(|{x : h^g^(x) = +1}|). That is, given 
if = i, any Z^{6i,) is conditionally independent from every bj for j ^ i, and is 
even conditionally independent from bi when \td is not completely contained 
in Xi', specifically, in this case, regardless of bi, the conditional distribution of 
given it = i and given \td is a product distribution, which determin¬ 
istically assigns label —1 to those Ytk{9*) with Xtk ^ Xi, and gives uniform 
random values to the subset of Ytdid*) with their respective Xtk € Finally, 
letting rt = Parity(|{fc < d : Ytk{9*) = +1}|), we note that given if = i, 
\td = Ti, and the value rt, bi is conditionally independent from Z^{6i,). Thus, 
the set of values CiT{9*) = {rt : it = i,^td = -Yij is a sufficient statistic for 
bi (hence for pi). Recall that, when if=i and 'Ktd = the value of rt is 
equal to Ct, a Bernoulli(pi) random variable. Thus, we neither lose nor gain 
anything (in terms of risk) by restricting ourselves to estimators qi of the type 
qi = gi(Z](6»*),...,Zj(6>*),i},...,4) = g'(C'iT(6'*)), for some q' [8]: that is, 
estimators that are a function of the NixiOi,) = \CiT{0i,)\ Bernoulli(pi) random 
variables, which we should note are conditionally i.i.d. given NiT{0C}- 

Thus, by ([T]), for any n <T, 


1 

2 


E 

6ie{-i,+i} 


E 


Idi -Pi\ 


N^rie^) 



= \ E ^ NiT{9i,) = 

bie{-i,+i} 

> (7^/32). exp {-1287^iV,/3}. 


Also note that, for each i, E[A^i] = '^/L\ T < {d/m)^'^T = d^^{2^rn/°‘T. 

\d) 

Thus, Jensen’s inequality, linearity of expectation, and the law of total expecta¬ 
tion imply 


i ^ E [\q,-p.\] > (7^/32). exp {-43(2/T)2'^/“d2<i^^+2<i/«T} . 
bie{-i.+i} 
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Thus, by linearity of the expectation, 


1\(”) 
2 


E E 


'C?) 

E- 


-Pi\ 


(T) , , 


fci6{-l, + l} 


> 


(7m/(32 • 2'*)). exp {-43(2/L)“''“<i“7,y“''“T} . 

In particular, taking m = \{L/2Y/°‘ (43(2/L)2‘^/“d2'^r) ^^^1, we have -fm = 
0 (^(43(2/L)2'i/“d2‘^T)"^^^), so that 


1\(“) 


E IE 

-i,+i>( ■i) 


be{-i,+i} 


■(“) ^ 


l9i -Pi\ 


= f2 2 


)-d 


(43(2/L)2‘^/“d2'^T^ 


In particular, this implies there exists some b for which 
■(™) 


E 


1 




2{d+Q;) 


12 ^2-'^ (43(2/L)2^/“d2^T) 
Applying this lower bound to the estimator pi above yields the result. 


It is natural to wonder how these rates might potentially improve if we allow 
Ot to depend on more than d samples per data set. To establish limits on such 
improvements, we note that in the extreme case of allowing the estimator to 
depend on the full data sets, we may recover the known results lower 

bounding the risk of density estimation from i.i.d. samples from a smooth density, 
as indicated by the following result. 

Theorem 3. For any integer d > 1, there exists an instanee space X, a eoncept 
space C of VC dimension d, a distribution T> over X, and a distribution ttq 
over C sueh that, for Uq the set of distributions over C with {L, a)-FIolder 
smooth density funetions with respect to TTn, any sequence of estimators, Or = 
eT{Z\e,),...,Z^{0,)) (T = l,2,...), has 


sup E 
s*e0 




)• 


The proof is a simple reduction from the problem of estimating based on 
direct access to , ■ ■ ■, hlf,g , which is essentially equivalent to the standard 
model of density estimation, and indeed the lower bound in Theorem [3] is a well- 
known result for density estimation from T i.i.d. samples from a Holder smooth 
density in a d-dimensional space [5] . 
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5 Real-Valued Functions and an Application in 
Algorithmic Economics 

In this section, we present results generalizing the analysis of [12] to classes of 
real-valued functions. We also present an application of this generalization to a 
preference elicitation problem. 

5.1 Consistent Estimation of Priors over Real-Valued Functions at 
a Bounded Rate 

In this section, we let B denote a cr-algebra on d:” x R, and again let Bx denote 
the corresponding cr-algebra on X. Also, for measurable functions : A —^ M, 

let p{h,g) = J \h — g\dPx, where Px is a distribution over X. Let be a 
class of functions A —> M with Borel cr-algebra Bx induced by p. Let 6> be a 
set, and for each 0 € 0, let tt^ denote a probability measure on {T,Bx)- We 
suppose {tt^ : 0 G 0} is totally bounded in total variation distance, and that 
is a uniformly bounded VC subgraph class with pseudodimension d. We also 
suppose p is a metric when restricted to P. 

As above, let be i.i.d. Px random variables. For each 9 £ 0, 

let be i.i.d. irg random variables, independent from For 

each t G N and 6 G 0, let Yti{9) = h^g{Xti) for i e N, and let Z*{9) = 
{{Xti,Yti{0)), {Xt 2 ,Yt 2 { 0 )), ■. for each fc e N, define Z^(0) = {{Xti,Yti{0)), 

..., {Xtk,Ytk{0))}, ^tk = {Xti, .. .,Xtk}, and ¥*^(0) = {Yni0),.. .,Ytk{0)}. 

We have the following result. The proof parallels that of [12] (who studied 
the special case of binary functions), with a few important twists (in particular, 
a significantly different approach in the analogue of their Lemma 3). The details 
are included in Appendix [Al 

Theorem 4. There exists an estimator OrSt, = ^r(-Z^(0*), ■ • ■,-2j(0*)), and 
functions R : Nq x (0,1] —t [0, oo) and ^ : No x (0,1] —>■ [0,1] such that, for any 
a > 0, lim R{T, a) = lim S{T, a) = 0 and for any T gNq and 0* G 0, 

T^oo 

< HT,a) < a. 


5.2 Maximizing Customer Satisfaction in Combinatorial Auctions 

Theorem [T] has a clear application in the context of transfer learning, following 
analogous arguments to those given in the special case of binary classification by 
[12]. In addition to that application, we can also use Theorem [4] in the context of 
the following problem in algorithmic economics, where the objective is to serve 
a sequence of customers so as to maximize their satisfaction. 

Consider an online travel agency, where customers go to the site with some 
idea of what type of travel they are interested in; the site then poses a series 
of questions to each customer, and identifies a travel package that best suits 
their desires, budget, and dates. There are many options of travel packages, with 


12 


Liu Yang, Steve Hanneke, and Jaime Carbonell 


options on location, site-seeing tours, hotel and room quality, etc. Because of this, 
serving the needs of an arbitrary customer might be a lengthy process, requiring 
many detailed questions. Fortunately, the stream of customers is typically not 
a worst-case sequence, and in particular obeys many statistical regularities: in 
particular, it is not too far from reality to think of the customers as being 
independent and identically distributed samples. With this assumption in mind, 
it becomes desirable to identify some of these statistical regularities so that we 
can pose the questions that are typically most relevant, and thereby more quickly 
identify the travel package that best suits the needs of the typical customer. One 
straightforward way to do this is to directly estimate the distribution of customer 
value functions, and optimize the questioning system to minimize the expected 
number of questions needed to find a suitable travel package. 

One can model this problem in the style of Bayesian combinatorial auctions, 
in which each customer has a value function for each possible bundle of items. 
However, it is slightly different, in that we do not assume the distribution of 
customers is known, but rather are interested in estimating this distribution; 
the obtained estimate can then be used in combination with methods based 
on Bayesian decision theory. In contrast to the literature on Bayesian auctions 
(and subjectivist Bayesian decision theory in general), this technique is able to 
maintain general guarantees on performance that hold under an objective in¬ 
terpretation of the problem, rather than merely guarantees holding under an 
arbitrary assumed prior belief. This general idea is sometimes referred to as 
Empirical Bayesian decision theory in the machine learning and statistics litera¬ 
tures. The ideal result for an Empirical Bayesian algorithm is to be competitive 
with the corresponding Bayesian methods based on the actual distribution of 
the data (assuming the data are random, with an unknown distribution); that 
is, although the Empirical Bayesian methods only operate with a data-based 
estimate of the distribution, the aim is to perform nearly as well as methods 
based on the true (unobservable) distribution. In this work, we present results 
of this type, in the context of an abstraction of the aforementioned online travel 
agency problem, where the measure of performance is the expected number of 
questions to find a suitable package. 

The specific application we are interested in here may be expressed abstractly 
as a kind of combinatorial auction with preference elicitation. Specifically, we 
suppose there is a collection of items on a menu, and each possible bundle of 
items has an associated fixed price. There is a stream of customers, each with a 
valuation function that provides a value for each possible bundle of items. The 
objective is to serve each customer a bundle of items that nearly-maximizes his 
or her surplus value (value minus price). However, we are not permitted direct 
observation of the customer valuation functions; rather, we may query for the 
value of any given bundle of items; this is referred to as a value query in the 
literature on preference elicitation in combinatorial auctions (see Chapter 14 
of [3], [I3])- The objective is to achieve this near-maximal surplus guarantee, 
while making only a small number of queries per customer. We suppose the 
customer valuation function are sampled i.i.d. according to an unknown distri- 
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bution over a known (but arbitrary) class of real-valued functions having finite 
pseudo-dimension. Reasoning that knowledge of this distribution should allow 
one to make a smaller number of value queries per customer, we are interested 
in estimating this unknown distribution, so that as we serve more and more cus¬ 
tomers, the number of queries per customer required to identify a near-optimal 
bundle should decrease. In this context, we in fact prove that in the limit, the 
expected number of queries per customer converges to the number required of a 
method having direct knowledge of the true distribution of valuation functions. 

Formally, suppose there is a menu of n items [n] = {l,...,n}, and each 
bundle B C [n] has an associated price p(i?) > 0. Suppose also there is a sequence 
of customers, each with a valuation function vt : 2[”1 —K. We suppose these Vt 
functions are i.i.d. samples. We can then calculate the satisfaction function for 
each customer as St(a;), where x G {0,1}”, and St{x) = Vt{Bx) — p[Bx)^ where 
Bx C [n] contains element i G [n] Xi = 1. 

Now suppose we are able to ask each customer a number of questions before 
serving up a bundle to that customer. More specifically, we are able to ask 
for the value St{x) for any x G {0,1}". This is referred to as a value query in 
the literature on preference elicitation in combinatorial auctions (see Chapter 
14 of d], dl]). We are interested in asking as few questions as possible, while 
satisfying the guarantee that E[st(it) — max^, St(a;)] < e. 

Now suppose, for every tt and e, we have a method A(TT,e) such that, given 
that TT is the actual distribution of the s* functions, A{tt^ e) guarantees that 
the Xt value it selects has E[max 2 ; st(a:) — St(it)] < e; also let 7 Vt( 7 r,e) denote 
the actual (random) number of queries the method A{'K^e) would ask for the s* 
function, and let Q{'K,e) = E[iVi( 7 r, e)]. We suppose the method never queries 
any St{x) value twice for a given t, so that its number of queries for any given t 
is bounded. 

Also suppose is a VC subgraph class of functions mapping X = {0,1}" 
into [—1,1] with pseudodimension d, and that {tts : 6 G &} is a known to¬ 
tally bounded family of distributions over T such that the St functions have 
distribution for some unknown 0* G 6 >. For any 0 G 0 and 7 > 0, let 
B( 6 >, 7 ) = jd' e 0 : Wire - Trg/1| < 7 }. 

Suppose, in addition to A, we have another method A'{e) that is not tt- 
dependent, but still provides the e-correctness guarantee, and makes a bounded 
number of queries (e.g., in the worst case, we could consider querying all 2 " 
points, but in most cases there are more clever 7r-independent methods that use 
far fewer queries, such as 0(l/e^)). Consider the following method; the quantities 
^TS*, R{T, a), and 5{T, a) from Theorem |3| are here considered with respect Px 
taken as the uniform distribution on { 0 , 1 }". 

The following theorem indicates that this method is correct, and furthermore 
that the long-run average number of queries is not much worse than that of a 
method that has direct knowledge of . The proof of this result parallels that 
of [ 12 ] for the transfer learning setting, but is included here for completeness. 
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Algorithm 1 An algorithm for sequentially maximizing expected customer sat¬ 
isfaction. 


for t = 1,2,.. . ,T do 

Pick points Xti,Xt 2 , ■ • ■, Xtd uniformly at random from { 0 , 1 }" 
if R{t — 1, e/2) > e /8 then 
Run A'{e) 

Take Xt as the returned value 
else 

Let 6te„ € B , R{t — 1, £/2) j be such that 


, min Q( 7 re,e/ 4 ) + 

Run ie/4) and let Xt be its return value 

end if 
end for 


1 

t 


Theorem 5. For the above method, Vt < T, E[max 3 , st(a;) — St(it)] < e. Fur¬ 
thermore, if St (s) is the total number of queries made by the method, then 

limsup < Qirg^^e/A) + d. 

Proof. By Theorem^ for any t < T, if R{t — 1, e/2) < e/8, then with probability 
at least 1 — e/2, || 7 rg^ — || < R{t — l,e/2), so that a triangle inequality 

implies Htt^^ — || < 2R{t — l,e/2) < e/4. Thus, 


E 


max St (a;) - St{xt) 

X 

<e/2-|-E E max Sf (a;) — St (it) 


Ot0„ 


II < e/2 


For 9 € 0, let xtg denote the point x that would be returned by 2l{Trg^^ je/4) 
when queries are answered by some Ste Trg instead of St (and supposing st = 
stej. If llTTg^^^ - TTeJI < e/4, then 


E 


max St (a;) - St(it) 


dt0„ 


= E 


max Ste, (a;) - St 0 „{xt) 


dte„ 


< E 


maxs 




+ IKeb.. - || < e/4 -f e/4 = e/2. 


Plugging into the above bound, we have E [max^, St(a;) — St(it)] < e. 

For the result on S'T(e), first note that R{t — 1, e/2) > e /8 only finitely many 
times (due to R[t,a) = o(l)), so that we can ignore those values of t in the 
asymptotic calculation (as the number of queries is always bounded), and rely 
on the correctness guarantee of A'. For the remaining values t, let Nt denote the 
number of queries made by ;£/4)- Then 


E[S'r(e)] , ^E[At] 

lim sup —^- < d-\- lim sup - - ■ 

T—¥oo T—>-oo ^ 


























Prior Estimation 


15 


Since 


-^ej> R{t-I, e/2)] 




T->oo T 


t=l 


1 T 

< 2” lim 1, e/2.) = 0, 

T—)-oo 1 


we have 


T T 

limsup^5-[^ =liinsup;^^E[iVtl[||7i-g - tt^JI <R{t-l,e/2)] 


T—^co 


T —yoo 


For t <T, let Nt{9te,) denote the number of queries A{TTg^g , e/4) would make if 
queries were answered with instead of St- On the event Htt^ 


R(t — 1, e/2), we have 




-T^eA < 


E 


Nt 


6te^ 


< E 


Nt{0teA0te, +2R{t-l,e/2) 


= QAe,e ’ ^/4) + 2i?(t-l, e/2) < e/4) + 2R{t-l, e/2) + 1/t. 


Therefore, 


limsup-^E - tt^JI < i?(f - 1,e/2)] 


T-yc 


T 


< Q(7re,,e/4) + lim sup ^ ^ 2R{t - l,e/2) + 1/t = Q(7re,,e/4). 


t=i 


In many cases, this result will even continue to hold with an inhnite number 
of goods in = oo), since Theorem |3| has no dependence on the cardinality of the 
space X. 


6 Open Problems 

There are several interesting questions that remain open at this time. Can either 
the lower bound or upper bound be improved in general? If, instead of d samples 
per task, we instead use m > d samples, how does the minimax risk vary with 
ml Related to this, what is the optimal value of m to optimize the rate of 
convergence as a function of mT, the total number of samples? More generally, 
if an estimator is permitted to use N total samples, taken from however many 
tasks it wishes, what is the optimal rate of convergence as a function of N1 
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A Proofs for Section [5] 

The proof of Theorem |4] is based on the following sequence of lemmas, which 
parallel those used by [12] for establishing the analogous result for consistent 
estimation of priors over binary functions. The last of these lemmas (namely, 
Lemma [3|) requires substantial modifications to the original argument of |12] ; 
the others use arguments more-directly based on those of [Si- 

Lemma 1. For any 9,9' G 0 and t S N, 


Ike - TTg/ll = ||P^t(6() - ^z*{d')\\- 

Proof. Fix 9,9' e0,te N. Let X = {Xti,Xt 2 ,..Y(9) = {Yti(9),Vt2(9 ),...}, 
and for fc e N let X^ = {Xti,.. .,Xtk}. and Yk{9) = {Yti{9),.. .,Ytki9)}. For 
h€X, let cx(h) = {{Xti,HXti)), iXt 2 , h{Xt 2 )), ■ • .}■ 

For h,g & F, define pxih,g) = lim X \h{Xu) - g{Xu)\ (if the limit 

m—^oo 

exists), and pxfc(h,g) = \ |h(Xti) — ( 7 (Xti)|. Note that since is a uniformly 

bounded VC subgraph class, so is the collection of functions {\h — g\ : h,g G F}, 
so that the uniform strong law of large numbers implies that with probability 
one, 'ih,g G F, px{h,g) exists and has px{h,g) = p{h,g) [TO] . 

Consider any 9,9' G 0, and any A G Bj-. Then any h ^ A has Vg G A, 
p{h,g) > 0 (by the metric assumption). Thus, if px{h,g) = p{h,g) for all h,g G 
F, then Vh ^ A, 

Vg G A, px{h, g) = p(h, g) > 0 ^ 

\/g G A,cx{h) k cx(g) => cx(h) ^ cx(^). 

This implies Cx^(cx(^)) = A. Under these conditions, 

P2*(e)|x(cx(^)) = 7re(cxkcx(^))) = t^0{A), 

and similarly for 9'. 

Any measurable set C for the range of Z*{9) can be expressed as C = {cj(/i) : 
(h, x) G C'} for some appropriate C G Letting C'^ = {h ■. {h, x) G C'}, 

we have 

I Mcf\c,{C'^)))Fx{dx) = 1 7r,(Cf)Px(dx)=P(^.^,x)(C'). 
Likewise, this reasoning holds for 9'. Then 


\\Fzt(9) - Pzqeoll 


sup /(7re(Ck - 7re/(Ck)Px(dx) 


< / sup \Trg(A) 
J 


Ike -7re/||. 
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Since h^g and X are independent, \/A e Bjr, TTe{A) = Fh*^{A) = ( 7 l)Px(d:’°°) 

= X X°°). Analogous reasoning holds for h^g,. Thus, we have 

he-M\ = l|P(^*,.x)(- X A-) -P(^.^,.x)(- X A“)|| 

< - IP(/i*^,,x)|| = \\^zt{e) -IPz‘(e')ll- 

Altogether, we have \\^z*(e) - ^z^^(S')\\ = Ike - tts'II- ^ 

Lemma 2. There exists a sequence = o(l) such that, \/t,k € N, \/6,0' G 0, 

lkz‘(e) ~ IPz*(S') II ^ Ike — Tre' || < \\^zl{e) — ^z*,{e') II + *"/:■ 

Proof. This proof follows identically to a proof of m, but is included here 
for completeness. Since P2*(e)(^) = IP2*(e)(bl x (A x ]R)°°) for all measurable 
Ac (A X R)^, and similarly for 6' , we have 

l|IP2:*(e) ~IP’2*(e')ll = ~^zl{e')i^) 

= sup Pzt(e)(A X (A X M)°°) - P2t(e')(A x (A x M)°°) 

< sup P2t(g)(A) — P2:t(e/)(A) = ||P2t(0) — P2:t(e/)II, 

A&B°° 

which implies the left inequality when combined with Lemma [T] 

Next, we focus on the right inequality. Fix 0,0' G 0 and 7 > 0, and let 
B G be such that 


Ike ~ ’’"e'll — Ikzqe) ~ ^z*{e>)\\ < ^z*(9){B) — Vz*{e>){B) + 7 . 

Let A = {A X (A X R)°° : A G B^,k G N}. Note that A is an algebra that 
generates B°°. Thus, Caratheodory’s extension theorem (specifically, the version 
presented by 0) implies that there exist disjoint sets {AkieN in A such that 
B c y Ai and 


'^Z*(9)iB) — Vz*{9')iB) < z^^{9){^i) — ^P2:‘(e')(bli) + 7- 

ieN ieN 

Since these Ai sets are disjoint, each of these sums is bounded by a probability 
value, which implies that there exists some n G N such that 


zGN i=l 


which implies 

zGN 


2=1 2=1 




7 + ^z^(9) 


Z\9') 
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As Ur=i Ai € A, there exists m € N and measurable € S™ such that 
Ur=i Ai = Bjn X X ]R)°°, and therefore 


< \\^Z^{e) - IP’2^(6l')ll ^ \\^Zl{ 0 ) - ^Zl{ 0 ')\\- 

Combining the above, we have Wire - t^ 0 ' || < hmfc_j.oo \\^zl{ 0 ) - II + By 

letting 7 approach 0 , we have 

Ike -TTg/ll < ^li^ \\^z*i0) -^zf^{0')\\- 

So there exists a sequence rk{9,0') = o(l) such that 

Vfc € N, Ike — II < \\^zl{0) — ^zl(0') II + fki9, 9 '). 

Now let 7 > 0 and let 0^ be a minimal 7 -cover of 0. Define the quantity ^-^( 7 ) = 
maxe^g/ge^ rk{9,9'). Then for any 9,9' G 0, let 9^ = argming,/^^^ |ke —TTg//1| and 
9'^ = argming/,gg)^ |ke' — 7J‘e"||- Then a triangle inequality implies that Vfc G N, 

Ike - T^e' II < Ike - T^e^ II + Ike^ - T^e; || -k |ke; - t^0' II 

<27-1- rk{ 9 j, 9 '^) + ||P2*(e^) ~ IP2:*(egll <27-1- + ||P2:t(e^) — IP2*(e;)ll- 

Triangle inequalities and the left inequality from the lemma statement (already 
established) imply 



lllfk^ie^) 
< Ike^ 


P2:*(e;)ll < l|P2:‘(e^)-^^‘(e)II + l|P2:‘(e)-P2:*(e')ll + l|Pz*(e;)-IPE‘(e') II 
’’"ell + \\^zl{0) - IP2*(e')|| + Ike; - ’’"e'll <27-1- ||P2:‘(e) - Pz‘(e')||- 


So in total we have 


Ike -TTe'll < 47 -krfc( 7 ) -k ||P 2 :t(e) -P 2 ‘(e')ll- 

Since this holds for all 7 > 0, defining = ink>o( 47 -krfc( 7 )), we have the right 
inequality of the lemma statement. Furthermore, since each rk{9,9') = o(l), and 
\0-y\ < 00, we have = o(l) for each 7 > 0 , and thus we also have = o(l). 

□ 

Lemma 3. Vt, fc G N, there exists a monotone function Mk{x) = o(l) such that, 
V6I, 9' G 0, 

l|IP 2 *(e) “P^:‘(e')ll < Affc ^||P 2 :t( 6 () — P 2 :‘( 6 l')ll) • 

Proof. Fix any t G N, and let X = {Xti,Xt 2 , ■ ■ ■} and Y(0) = {Yti{9),Yt2(9), .. .}, 
and for fc G N let Xfc = {Xti,.. .,Xtk} and Yk(9) = {Yti{9),... ,Ytk{9)}. 

If fc < d, then P 2 :t( 6 ()(-) = ^zYB)i' x {X x {—1,-kl})'^“^), so that 
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and therefore the result trivially holds. 

Now suppose k > d. Fix any 7 > 0, and let Bq^i C (A” x R)^' be a measurable 
set such that 

< ^zl{e){Bgfi') — f’ 2 l{e')(.Bg^s') + 7 . 

By Caratheodory’s extension theorem (specifically, the version presented by [8]), 
there exists a disjoint sequence of sets {Bi{9, such that 

00 00 

^zl{g){Be,e') - ^zl{g'){Bg,g') < j + '^Vzl{g){Bi{d,d')) -'^Fzl{g'){Bi{d,d')), 

i=l i=l 

and such that each Bi{9, 9') is representable as follows; for some (.i{9,9') G N, and 
sets Cij = [Aiji X (-oo,fyi]) X • • • X [Aijk x {-oo,tijk\), for j < (.i{9,9'), where 
each Aijp G Bx, the set Bi{9,9') is representable as where 

C {0,..., - 1}, each Ajs G {a^, C^-}, and s s' ^ Aj. n 

_ 0 ^ Since the Bi{9,9') are disjoint, the above sums are bounded, 
so that there exists mk{9,9', 7 ) G N such that every m > mk{9, 9', 7 ) has 

m m 

^zl(g){Bg,g') —Fzt(^ei){Bg^g') < + '^Pzt(^g){Bi{9,9')) — Fzi(g'){Bi{9,9')), 


i=l 


i=l 


Now define Mkij) = maxg^g'^o^ rnk{9, 9', 7 ). Then for any 9, 9' G O, let 9-y, 9'^ G 
be such that || 7 re — 'Kg^W < 7 and || 7 re/ — tt^/ || < 7 , which implies IjP^qe) — 
^zl{e^)\\ < 7 and ||Pz*( 0 ') - P.z^(egll < 7 by LemmaH Then 

W^zlig) -Pz‘(0')|| < W'^zlie^) -P2:*(ei,)|| +27 

< Pzl(gy)(Bg^,g'^) - Pzl(g'^)(Bg^,gi^) + 87 

Mk{l) 

< ^ '^zl(e^){Bi(9^,9'^))-Pzl{e'^){Bi{9^,9'^)) 

Z = 1 

Again, since the Bi{9^,9') are disjoint, this equals 


^ 7 ! "7 


' Mk{l) 


57 + Pz*(.,) U B^{9„9'^)\-Pziig'^)\ IJ BA9„9'^) 




/ Mfc(7) 




' Mk{l) 


<77 + P2:t(e) ^ [J Bi{9^,9'^) j -Fzt(g>) | [J Bi{9^,9'^) 


Mkil) 


— 77 + ^ Fzi(0){Bi(9^,9'^))-Fzi{g'){Bi(9^,9'^)) 

i=l 

<77 + Mfc( 7 ) niax F zi(g){B^{9^,9'))-F ziig'){B^{9^,9'^)) 
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Thus, if we can show that each term 0 ^^,)) — P2*(6»')(-^*(^7’^7)) 

is bounded by a o(l) function of ||P2:t(6i) —^z^(e')\\j then the result will follow 
by substituting this relaxation into the above expression and defining Mk by 
minimizing the resulting expression over 7 > 0. 

Toward this end, let Cij be as above from the definition of and 

note that lBi{e^,e') is representable as a function of the Icy indicators, so that 




< max E 


•.j&j / j^j \ / 

- (u^cjzue'))] n (i-/c„(4(0')) 

\j&j ) j^j \ / 


< max 

= max 


E 


E 


Wic.Azim-Wic^Aziio')) 




jeJ 


U^c^A^km-U^cjzuo')) 

j&J j£J 


"'zm n -^ 21 ( 0 ') n Q 




KjeJ 


Note that Hjej C'y can be expressed as some (Aix(—cx), ti])x- • •x(Afcx(—cx), tk]), 
where each Ap G Bx and tp G M, so that, for i = maxg^g/^0^ max^^jg^j-^^ ^i(0, 9 ') 
and Ck = {(^1 X (-00, ti]) x • • • x (Afe x (-00,4]) : Vj < k,Aj G BxAk € M}, 
this last expression is at most 


A sup 
CGCfc 


^zi{e){C) - ^zi{e'){C) 


Next note that for any C = {Ai x (—cx), ti]) x • • ■ x {Ak x (—cx), tk]) G Cfc, letting 
Cl = Ai X ■ ■ ■ X Ak and C2 = (—cx), ti] x • • • x (—00, tk], 

P2*(.)(C) -P2*(.o(C') =E [(Py,,(,)|x,.(C2) -Py,.(.0|x..(C^2)) IcA^tk)] 

^ IE [|lPYtfc( 0 )|Xtfc {C 2 ) - IPYtfc(e')|Xtfc (^ 2 ) 1 ]- 

For p G { 1 ,..., fc}, let C2p = {—00, tp\. Then note that, by definition of d, for 
any given x = (a:i,... ,Xk), the class T-Lx = {xp 1-^ Ic2pih(xp)) • ^ G B} is a VC 
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class over {xi,..., x^} with VC dimension at most d. Furthremore, we have 


'PYtfc(e)|Xtfc(C' 2 ) - IP¥tfc(e')|Xtfc(C' 2 )| 

“ %c 2 i (-Yti))...../c 2 j?^:,,(^tC))ixt.({(i> ■ ■ • > 1 )}) 


Therefore, the results of [12] (in the proof of their Lemma 3) imply that 


PYtfc(e)|Xtfc (02) - ]PYtfc(0')|Xtfc (<^ 2 ) 


<2^ max max 


{tc72j I {^tj }j€D 

~ ^{^C2j i^^ls/i^tj))}jeD\{Xtj}ji=D ({y}) 


Thus, we have 


[|'P'Ytfc(e)|Xtfc(C' 2 ) - IPYtfc(e')|Xtfc(C' 2 ) 


< 2^E 


max max 


< 2^= ^ J2 ]E 

yG{ 0 . 1 }‘^ £>£{!,. 


{Ic2j ihtgiXtj))}jeD\{Xtj}j,^D ({y}) 

-^{Ic2j(Ke'(^tjmjeD\{Xtg}jeDi{y}) 
'’UC22 {{y}) 

-^{ic2^(hi^,(XtjmjeD\{Xtg}jeD{{y}) 


< 2^+'^k^ max max E 


^{IC22 (^*»(-^o))Lei3|{^oLer> ({y}) 

~ ^{^C2j il^lgl(Xtj))}jeD\{Xtj}ji=D ({y}) 
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Exchangeability implies this is at most 





2‘^+'^k‘^ max 
yejo.i}'' 





m argue that for all y G {0, l}'^ and ti,... ,td G R, 



Noting that 



completes the proof. 


□ 


We are now ready for the proof of Theorem 0] 

Proof (Proof of Theorem W- The estimator 9x9^, we will use is precisely the 
minimum-distance skeleton estimate of 'P'z*{ 9 ^,) |13I5] . [13] proved that if N{£) 

is the e-covering number of {P 2 *(e*) : ^ G 0}, then taking this 9x9^, estimator, 
then for some = 0((l/e^) logiV(e/4)), any T >T^ has 



Thus, taking Gx = inf{e > 0 : T > T^}, we have 



Letting P'(T, a) be any positive sequence with Gx ^ a) -C 1 and i?'(T, a) > 
Gx/a, and letting S(T,a) = Gx/R'{T,a) = o(l), Markov’s inequality implies 



Letting R(T,a) = mink {Mk {R'{T, a)) + rt), since R'{T,a) = o(l) and Xk = 
o(l), we have R(T,a) = o(l). Furthermore, composing ([2]) with Lemmas (U [31 
and 131 we have 



□ 
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Remark: Although the above proof makes use of the minimum-distance skeleton 
estimator, which is typically not computationally efficient, it is often possible 
to achieve this same result (for certain families of distributions) using a simpler 
estimator, such as the maximum likelihood estimator. All we require is that the 
risk of the estimator converges to 0 at a known rate that is independent of 0*. 
For instance, see [6] for conditions on the family of distributions sufficient for 
this to be true of the maximum likelihood estimator. 

References 

1. Bar-Yossef, Z.: Sampling lower bounds via information theory. In; Proceedings 
of the 35th Annual ACM Symposium on the Theory of Computing, pp. 335-344 
(2003) 

2. Baxter, J.: A Bayesian/information theoretic model of learning to learn via multiple 
task sampling. Machine Learning 28, 7-39 (1997) 

3. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the 
Vapnik-Chervonenkis dimension. Journal of the Association for Computing Ma¬ 
chinery 36(4), 929-965 (1989) 

4. Cramton, P., Shoham, Y., Steinberg, R.: Combinatorial Auctions. The MIT Press 
(2006) 

5. Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, 
New York, NY, USA (2001) 

6. van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press 

( 2000 ) 

7. Poland, J., Hutter, M.: MDL convergence speed for Bernoulli sequences. Statistics 
and Computing 16, 161-175 (2006) 

8. Schervish, M.J.: Theory of Statistics. Springer, New York, NY, USA (1995) 

9. Vapnik, V.: Estimation of Dependencies Based on Empirical Data. Springer-Verlag, 
New York (1982) 

10. Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies 
of events to their probabilities. Theory of Probability and its Applications 16, 
264-280 (1971) 

11. Wald, A.: Sequential tests of statistical hypotheses. The Annals of Mathematical 
Statistics 16(2), 117-186 (1945) 

12. Yang, L., Hanneke, S., Carbonell, J.: A theory of transfer learning with applications 
to active learning. Machine Learning 90(2), 161-189 (2013) 

13. Yatracos, Y.G.: Rates of convergence of minimum distance estimators and Kol¬ 
mogorov’s entropy. The Annals of Statistics 13, 768-774 (1985) 

14. Zinkevich, M., Blum, A., Sandholm, T.; On polynomial-time preference elicita¬ 
tion with value queries. In; Proceedings of the d*’*' ACM Conference on Electronic 
Commerce, pp. 175-185 (2003) 


